ImputAccur: fast and user-friendly calculation of genotype-imputation accuracy-measures

Background ImputAccur is a software tool to measure genotype-imputation accuracy. Imputation of untyped markers is a standard approach in genome-wide association studies to close the gap between directly genotyped and other known DNA variants. However, high accuracy for imputed genotypes is fundamental. Several accuracy measures have been proposed, but unfortunately, they are implemented on different platforms, which is impractical. Results With ImputAccur, the accuracy measures info, Iam-hiQ and r2-based indices can be derived from standard output files of imputation software. Sample/probe and marker filtering is possible. This allows e.g. accurate marker filtering ahead of data analysis. Conclusions The source code (Python version 3.9.4), a standalone executive file, and example data for ImputAccur are freely available at https://gitlab.gwdg.de/kolja.thormann1/imputationquality.git. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04863-z.

Furthermore, ImputAccur classifies markers to be located in a "cold", "tepid", "hot", or "very hot" region, the last indicating massively inaccurate imputation, as outlined by Rosenberger et al. [6] Details and equations of these accuracy indices and the classification are summarized in the Additional file 1. The validity of the calculations was tested by comparison with output files from IMPUTE2 (for info) and with known results of carefully selected sample data (all indices).

Implementation
ImputAccur requires the user to provide marker information (leading information) along with the estimated a-posteriori genotype probabilities (dosages) as an input file, which is a plain text file (or zipped). These are standard files generated by imputation software. Each row contains information on one marker. The second and third columns should contain [2] a unique marker name and [3] its physical position (e.g. on the chromosome). Probabilities for the genotypes 0, 1, and 2 of each sample/individual can be contained in 3 (summing to 1) or 2 (amended to 1) columns. Missing or inaccurate imputations are indicated by negative values. Hence, the number (no.) of rows in the input file equals the no. of genomic markers; the no. of columns equals the no. of leading columns + 2/3 times the no. of samples/individuals.
Basic settings for program control (e.g. name and path of the input file) and/or the structure of the input files (e.g. number of leading columns, 2/3 genotype probabilities) can be defined in an additional parameter file (params.txt). There is also the option to specify files containing either markers (matching to marker names) or samples/individuals (numbers matching to column order in the input-file) to be excluded from the calculation. One can also provide names for the leading columns; however, the second and third column will always be named "SNP" and "position".

Launching the application
To invoke the Python code of ImputAccur, the user may use the following command syntax: Alternatively, one can run ImputAccur as an executable file (ImputAccur.exe) or without the parameter file (e.g. on a Windows operating system). The program will then ask for the parameters to be entered interactively. For use on Ubuntu, the program can be started via the terminal by navigating to the folder containing the program and parameter file and entering "python [NAME OF PROGRAM].py -f params.txt". Alternatively, it can be started using only the command "python [NAME OF PROGRAM].py" in the corresponding folder. The program will then ask for the parameters one at a time as well.

Runtime/performance
ImputAccur needed less than 0.8 s per marker to calculate the accuracy indices based on dosages of 10,000 probes/individuals. This was carried out on the High Performance Computing (HPC) clusterof the University of Göttingen/GWDG (https:// www. gwdg. de/ hpc). The calculation took less than 0.08 s for 1000 probes, less than 0.008 s for 100 probes, and so on. We assessed the performance of ImputAccur on Scientific Linux and on Ubuntu 18.04.6 with Python 3.9.4, 3.7.3, and 3.6.13., as well as on Windows 10 Pro Build 21H2.

Results/example
Assume your input-file (see example test1 in the Additional file 1) contains information of 7 SNPs in three leading columns and 3 genotype probabilities each of 5 samples/ individuals. Hence, the file has 7 rows and 3 + 3 × 5 = 18 columns. Because the SNPs rs00001, rs00003, and rs00004 are quality control markers, these are listed in the file exclude_SNP.txt. Because individuals 1 and 2 are external controls, these are listed in exclude_PROBE.txt. This is the input file (test1.imputed): For this, one needs to set the following program parameters in params.txt or during the execution:

Output interpretation (for marker rs00005)
A-posteriori genotype probabilities of N = 5 individuals were contained in the input file for SNP rs00005 (fifth in the input file) at position 221 on the considered chromosome. A frequency (MAF) of 26% for the minor allele can be derived from these dosages. According to the accuracy indices Iam chance (0.546) and Iam HWE (0.448) it is reasonable to assume that about half of the information contained in dosages comes from the true (but unknown) genotypes of the individuals in the sample, the other half comes from the population used as a reference for genotype imputation. The values are borderline near the recommended threshold of 0.47 [6]. The difference between Iam chance and Iam HWE is the "anchor point" used, which is either purely populations-related dosages (Hardy-Weinberg Equilibrium HWE, taking MAF into account) or pure chance (1/3 probability for each of the three possible genotypes of a SNP). Iam chance and Iam HWE usually have comparable values [6]. A value of 0.099 for info indicates that only 10% of the statistical information on the minor population allele frequency (MAF), given "known" genotypes, remained after the ; each dot represents one imputed marker; the marker size is according to minor allele frequency (MAF); threshold values are freely selectable; vertical lines: centre of region classified as: "cold", "tepid", "hot", or "very hot" (the definition is given in the Additional file 1).
genotypes had been imputed for rs00005 [3]. For the measure info, threshold values such as 0.8 or 0.3 have been proposed, but without sound justification [3,7,8].
A value of 0.739 for hiQ indicates insufficient heterogeneity of dosages across all samples/individuals, as it is lower than the recommended threshold of 0.97 [6]. The imputation seems to have resulted in dosages too similar to be used for statistical inference testing.
Since r 2 MACH has a value of 0.319, one can conclude that the power of an allelic test, in the case of a binary trait, based on the imputed genotypes of rs00005 is approximately √ 0.319 = 0.56 times that of the same test if all genotypes were present. The same applies for r 2 Beagle [2]. Overall, all indices identify rs00005 as a marker with poor imputation. The "accuracy" is rated as "hot", indicating that rs00005 is also located in a genomic region enriched with markers of poor imputation.
In addition, Fig. 1 illustrates the regional accuracy of all indices, including the classification from "cold" to "very hot". The telomere region of 0 to 17.5 kb of chromosome 9 is plotted. One can easily see that the imputation is not accurate near the ends of each sister chromatid. One can also see the differences between the accuracy indices, especially for rare markers (low MAF-small dots), and realize how critical the choice of appropriate thresholds can be (including the classification implemented in ImputAccur). The example data used are described elsewhere [6].

Conclusion
ImputAccur is an easy-to-use software to determine multiple measures of accuracy for imputed genotypes and independent on the imputation platform used. This allows greater flexibility in post-imputation variant filtering. Because it also delivers regional classification, poorly imputed chromosome segments may be identified.

GWAS
Genome-wide association studies SNP Single nucleotide polymorphism MAF Minor allele frequency Dosages Estimated a-posteriori genotype probabilities HWE Hardy-Weinberg equilibrium